Self - Indexing Based on LZ 77 ? Sebastian

نویسنده

  • Gonzalo Navarro
چکیده

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1–2 million characters of the text per second, and finds patterns at a rate of 10–50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On compressing and indexing repetitive sequences

We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This self-index is particularl...

متن کامل

Differential Ziv-Lempel Text Compression

We describe a novel text compressor which combines Ziv-Lempel compression and arithmetic coding with a form of vector quantisation. The resulting compressor resembles an LZ-77 compressor, but with no explicit phrase lengths or coding for literals. An examination of the limitations on its performance leads to some predictions of the limits of LZ-77 compression in general, showing that the LZ-77 ...

متن کامل

Image Compression using Growing Self Organizing Map Algorithm

This paper presents a neural network based technique that may be applied to image compression. Conventional techniques such as Huffman coding and the Shannon Fano method, LZ Method, Run Length Method, LZ-77 are more recent methods for the compression of data. A traditional approach to reduce the large amount of data would be to discard some data redundancy and introduce some noise after reconst...

متن کامل

Augmenting LZ-77 with authentication and integrity assurance capabilities

The formidable dissemination capability allowed by the current network technology makes it increasingly important to devise new methods to ensure authenticity and integrity. Nowadays it is common practice to distribute documents in compressed form. In this paper, we propose a simple variation on the classic LZ-77 algorithm that allows one to hide, within the compressed document, enough informat...

متن کامل

پیچیدگی LZ سیستم های دینامیکی آشوبی و سیستم شبه تناوبی فیبوناچی

  The origin the concept of LZ compexity is in information science. Here we use this notion to characterize chaotic dynamical systems. We make contact with the usual characteristics of chaos, such as Lyapunov exponent and K-entropy. It is shown that for a two-dimensional system LZ complexity is as powerful as other characteristics. We also apply LZ complexity to the study of the quasiperiodic F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011